High Performance Document Layout Analysis
نویسنده
چکیده
In this paper, I summarize research in document layout analysis carried out over the last few years in our laboratory. Correct document layout analysis is a key step in document capture conversions into electronic formats, optical character recognition (OCR), information retrieval from scanned documents, appearance-based document retrieval, and reformatting of documents for on-screen display. We have developed a number of novel geometric algorithms and statistical methods. Layout analysis systems built from these algorithms are applicable to a wide variety of languages and layouts, and have proven to be robust to the presence of noise and spurious features in a page image. The system itself consists of reusable and independent software modules that can be reconfigured to be adapted to different languages and applications. Currently, we are using them for electronic book and document capture applications. If there is commercial or government demand, we are interested in adapting these tools to information retrieval and intelligence applications.
منابع مشابه
Document image analysis with cooperative interaction between layout analysis and logical structure analysis
When a printed document is to be input to a computer system, the document must be converted to a computer-readable format, e.g., ASCII, PDF, RTF, CSV, or SGML/XML/HTML-tagged data. In order to obtain these data formats from a printed document, it is necessary to extract from the printed document as much information as possible, i.e., layout structure (layout objects and their hierarchical relat...
متن کاملBayesian Learning of 2d Document Layout Models for Preservation Metadata Extraction
Digital preservation addresses the storage, maintenance, accessibility, and technical integrity of digital materials over the long term. Preservation metadata is the information required to perform these tasks. Given the volume of these journals and high labor cost of manual metadata entry, automated metadata extraction is necessary. Document layout analysis is a process of partitioning documen...
متن کاملDocument Layout Structure Extraction Using Bounding Boxes of Diierent Entities
This paper presents an eecient and accurate technique for document page layout structure extraction and classiication by analyzing the spatial connguration of the bounding boxes of diierent entities on a given image. The text, table, and nontext structures are detected on document images. The text-lines and words are extracted and the tabular structure is further decomposed into row and column ...
متن کاملAn Ultra High CMRR Low Voltage Low Power Fully Differential Current Operational Amplifier (COA)
this paper presents a novel fully differential (FD) ultra high common mode rejection ratio (CMRR) current operational amplifier (COA) with very low input impedance. Its FD structure that attenuates common mode signals over all stages grants ultra high CMRR and power supply rejection ratio (PSRR) that makes it suitable for mixed mode and accurate applications. Its performance is verified by HSPI...
متن کاملDocument structure analysis algorithms: a literature survey
Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component level in terms of regions or blocks. This paper provides a detailed survey of past work on document...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003